Overview

For the operation in the future, having a overview can help us understand the dataset easily.

dim(USvideos)
[1] 40949    16
str(USvideos)
Classes ‘spec_tbl_df’, ‘tbl_df’, ‘tbl’ and 'data.frame':    40949 obs. of  16 variables:
 $ video_id              : chr  "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
 $ trending_date         : chr  "17.14.11" "17.14.11" "17.14.11" "17.14.11" ...
 $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
 $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
 $ category_id           : num  22 24 23 24 24 28 24 28 1 25 ...
 $ publish_time          : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
 $ tags                  : chr  "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
 $ views                 : num  748374 2418783 3191434 343168 2095731 ...
 $ likes                 : num  57527 97185 146033 10172 132235 ...
 $ dislikes              : num  2966 6146 5339 666 1989 ...
 $ comment_count         : num  15954 12703 8181 2146 17518 ...
 $ thumbnail_link        : chr  "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
 $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ description           : chr  "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...
 - attr(*, "problems")=Classes ‘tbl_df’, ‘tbl’ and 'data.frame':    1533544 obs. of  5 variables:
  ..$ row     : int  2 2 2 2 2 2 3 3 3 3 ...
  ..$ col     : chr  "tags" "tags" "tags" "tags" ...
  ..$ expected: chr  "delimiter or quote" "delimiter or quote" "delimiter or quote" "delimiter or quote" ...
  ..$ actual  : chr  "|" "l" "|" "j" ...
  ..$ file    : chr  "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" "'data/USvideos.csv'" ...
 - attr(*, "spec")=
  .. cols(
  ..   video_id = col_character(),
  ..   trending_date = col_character(),
  ..   title = col_character(),
  ..   channel_title = col_character(),
  ..   category_id = col_double(),
  ..   publish_time = col_datetime(format = ""),
  ..   tags = col_character(),
  ..   views = col_double(),
  ..   likes = col_double(),
  ..   dislikes = col_double(),
  ..   comment_count = col_double(),
  ..   thumbnail_link = col_character(),
  ..   comments_disabled = col_logical(),
  ..   ratings_disabled = col_logical(),
  ..   video_error_or_removed = col_logical(),
  ..   description = col_character()
  .. )

Assert outlier

Now we need to make sure is there any outlier or mistake in the dataset.

Assert category_id

First, test the column called “category_id”. There are 43 categories, therefore the values in the column should not be bigger than 43 or smaller than 1.

assert(data = USvideos, in_set(1, 43, allow.na = FALSE), category_id) 
Column 'category_id' violates assertion 'in_set(1, 43, allow.na = FALSE)' 38547 times
  [omitted 38542 rows]
Error: assertr stopped execution

There are 5 rows have NA in this column, we can just remove them later.

Assert numerical columns

For the numerical columns in the dataset, based on the reality, all of them should be positive.

rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), views) r assert(data = USvideos, within_bounds(lower.bound = 0,upper.bound = Inf, allow.na = FALSE), likes)

rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), dislikes)

rr assert(data = USvideos, within_bounds(lower.bound = 0, upper.bound = Inf, allow.na = FALSE), comment_count)

Fortunately, all of the numbers are positive. There is no mistake.

Aassert logical columns

And for the logical columns, all of the values should be TRUE or FALSE.

rr assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), comments_disabled) r assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), ratings_disabled)

rr assert(data = USvideos, in_set(TRUE, FALSE, allow.na = FALSE), video_error_or_removed)

And there is no error too.

Clean the data.

Remove NA

Because there are only several observations with NA values, we can just remove all of the rows which have NA value.

USvideos_NNA <- as.data.frame(na.omit(USvideos))
USvideos_NNA

Convert date column

Then we need to convert the column called “trending_date” with character type to normal date format in “lubridate” package.

USvideos_NNA <- USvideos_NNA %>%
  mutate(trending_date = ydm(trending_date))

Result overview

Now let’s look through the structure of dataset again.

str(USvideos_NNA)
'data.frame':   40371 obs. of  16 variables:
 $ video_id              : chr  "2kyS6SvSYSE" "1ZAPwfrtAFY" "5qpjK5DgCt4" "puqaWrEC7tY" ...
 $ trending_date         : Date, format: "2017-11-14" "2017-11-14" "2017-11-14" "2017-11-14" ...
 $ title                 : chr  "WE WANT TO TALK ABOUT OUR MARRIAGE" "The Trump Presidency: Last Week Tonight with John Oliver (HBO)" "Racist Superman | Rudy Mancuso, King Bach & Lele Pons" "Nickelback Lyrics: Real or Fake?" ...
 $ channel_title         : chr  "CaseyNeistat" "LastWeekTonight" "Rudy Mancuso" "Good Mythical Morning" ...
 $ category_id           : num  22 24 23 24 24 28 24 28 1 25 ...
 $ publish_time          : POSIXct, format: "2017-11-13 17:13:01" "2017-11-13 07:30:00" "2017-11-12 19:05:24" "2017-11-13 11:00:04" ...
 $ tags                  : chr  "SHANtell martin" "last week tonight trump presidency\"|\"last week tonight donald trump\"|\"john oliver trump\"|\"donald trump" "racist superman\"|\"rudy\"|\"mancuso\"|\"king\"|\"bach\"|\"racist\"|\"superman\"|\"love\"|\"rudy mancuso poo be"| __truncated__ "rhett and link\"|\"gmm\"|\"good mythical morning\"|\"rhett and link good mythical morning\"|\"good mythical mor"| __truncated__ ...
 $ views                 : num  748374 2418783 3191434 343168 2095731 ...
 $ likes                 : num  57527 97185 146033 10172 132235 ...
 $ dislikes              : num  2966 6146 5339 666 1989 ...
 $ comment_count         : num  15954 12703 8181 2146 17518 ...
 $ thumbnail_link        : chr  "https://i.ytimg.com/vi/2kyS6SvSYSE/default.jpg" "https://i.ytimg.com/vi/1ZAPwfrtAFY/default.jpg" "https://i.ytimg.com/vi/5qpjK5DgCt4/default.jpg" "https://i.ytimg.com/vi/puqaWrEC7tY/default.jpg" ...
 $ comments_disabled     : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ ratings_disabled      : logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ video_error_or_removed: logi  FALSE FALSE FALSE FALSE FALSE FALSE ...
 $ description           : chr  "SHANTELL'S CHANNEL - https://www.youtube.com/shantellmartin\\nCANDICE - https://www.lovebilly.com\\n\\nfilmed t"| __truncated__ "One year after the presidential election, John Oliver discusses what we've learned so far and enlists our cathe"| __truncated__ "WATCH MY PREVIOUS VIDEO ▶ \\n\\nSUBSCRIBE ► https://www.youtube.com/channel/UC5jkXpfnBhlDjqh0ir5FsIQ?sub_confir"| __truncated__ "Today we find out if Link is a Nickelback amateur or a secret Nickelback devotee. GMM #1218\\nDon't miss an all"| __truncated__ ...
LS0tCnRpdGxlOiAiMjAtZGF0YS1jbGVhbmluZyIKb3V0cHV0OiBodG1sX25vdGVib29rCi0tLQoKIyMgT3ZlcnZpZXcKCkZvciB0aGUgb3BlcmF0aW9uIGluIHRoZSBmdXR1cmUsIGhhdmluZyBhIG92ZXJ2aWV3IGNhbiBoZWxwIHVzIHVuZGVyc3RhbmQgdGhlIGRhdGFzZXQgZWFzaWx5LgoKYGBge3Igb3ZlcnZpZXd9CmRpbShVU3ZpZGVvcykKc3RyKFVTdmlkZW9zKQpgYGAKCgojIyBBc3NlcnQgb3V0bGllcgoKTm93IHdlIG5lZWQgdG8gbWFrZSBzdXJlIGlzIHRoZXJlIGFueSBvdXRsaWVyIG9yIG1pc3Rha2UgaW4gdGhlIGRhdGFzZXQuCgoKIyMjIEFzc2VydCBjYXRlZ29yeV9pZAoKRmlyc3QsIHRlc3QgdGhlIGNvbHVtbiBjYWxsZWQgImNhdGVnb3J5X2lkIi4gVGhlcmUgYXJlIDQzIGNhdGVnb3JpZXMsIHRoZXJlZm9yZSB0aGUgdmFsdWVzIGluIHRoZSBjb2x1bW4gc2hvdWxkIG5vdCBiZSBiaWdnZXIgdGhhbiA0MyBvciBzbWFsbGVyIHRoYW4gMS4KCmBgYHtyIGFzc2VydCBjYXRlZ29yeX0KYXNzZXJ0KGRhdGEgPSBVU3ZpZGVvcywgaW5fc2V0KDEsIDQzLCBhbGxvdy5uYSA9IEZBTFNFKSwgY2F0ZWdvcnlfaWQpIApgYGAKClRoZXJlIGFyZSA1IHJvd3MgaGF2ZSAqTkEqIGluIHRoaXMgY29sdW1uLCB3ZSBjYW4ganVzdCByZW1vdmUgdGhlbSBsYXRlci4KCiMjIyBBc3NlcnQgbnVtZXJpY2FsIGNvbHVtbnMKCkZvciB0aGUgbnVtZXJpY2FsIGNvbHVtbnMgaW4gdGhlIGRhdGFzZXQsIGJhc2VkIG9uIHRoZSByZWFsaXR5LCBhbGwgb2YgdGhlbSBzaG91bGQgYmUgcG9zaXRpdmUuCgpgYGB7ciBhc3NlcnQgcG9zaXRpdmUgbnVtYmVyfQphc3NlcnQoZGF0YSA9IFVTdmlkZW9zLCB3aXRoaW5fYm91bmRzKGxvd2VyLmJvdW5kID0gMCwgdXBwZXIuYm91bmQgPSBJbmYsIGFsbG93Lm5hID0gRkFMU0UpLCB2aWV3cykKYXNzZXJ0KGRhdGEgPSBVU3ZpZGVvcywgd2l0aGluX2JvdW5kcyhsb3dlci5ib3VuZCA9IDAsdXBwZXIuYm91bmQgPSBJbmYsIGFsbG93Lm5hID0gRkFMU0UpLCBsaWtlcykKYXNzZXJ0KGRhdGEgPSBVU3ZpZGVvcywgd2l0aGluX2JvdW5kcyhsb3dlci5ib3VuZCA9IDAsIHVwcGVyLmJvdW5kID0gSW5mLCBhbGxvdy5uYSA9IEZBTFNFKSwgZGlzbGlrZXMpCmFzc2VydChkYXRhID0gVVN2aWRlb3MsIHdpdGhpbl9ib3VuZHMobG93ZXIuYm91bmQgPSAwLCB1cHBlci5ib3VuZCA9IEluZiwgYWxsb3cubmEgPSBGQUxTRSksIGNvbW1lbnRfY291bnQpCmBgYAoKRm9ydHVuYXRlbHksIGFsbCBvZiB0aGUgbnVtYmVycyBhcmUgcG9zaXRpdmUuIFRoZXJlIGlzIG5vIG1pc3Rha2UuCgojIyMgQWFzc2VydCBsb2dpY2FsIGNvbHVtbnMKCkFuZCBmb3IgdGhlIGxvZ2ljYWwgY29sdW1ucywgYWxsIG9mIHRoZSB2YWx1ZXMgc2hvdWxkIGJlIFRSVUUgb3IgRkFMU0UuCgpgYGB7ciBhc3NlcnQgbG9naWNhbH0KYXNzZXJ0KGRhdGEgPSBVU3ZpZGVvcywgaW5fc2V0KFRSVUUsIEZBTFNFLCBhbGxvdy5uYSA9IEZBTFNFKSwgY29tbWVudHNfZGlzYWJsZWQpCmFzc2VydChkYXRhID0gVVN2aWRlb3MsIGluX3NldChUUlVFLCBGQUxTRSwgYWxsb3cubmEgPSBGQUxTRSksIHJhdGluZ3NfZGlzYWJsZWQpCmFzc2VydChkYXRhID0gVVN2aWRlb3MsIGluX3NldChUUlVFLCBGQUxTRSwgYWxsb3cubmEgPSBGQUxTRSksIHZpZGVvX2Vycm9yX29yX3JlbW92ZWQpIApgYGAKCkFuZCB0aGVyZSBpcyBubyBlcnJvciB0b28uCgoKIyMgQ2xlYW4gdGhlIGRhdGEuIAoKIyMjIFJlbW92ZSAqTkEqCgpCZWNhdXNlIHRoZXJlIGFyZSBvbmx5IHNldmVyYWwgb2JzZXJ2YXRpb25zIHdpdGggTkEgdmFsdWVzLCB3ZSBjYW4ganVzdCByZW1vdmUgYWxsIG9mIHRoZSByb3dzIHdoaWNoIGhhdmUgTkEgdmFsdWUuCmBgYHtyIHJlbW92ZSBOQX0KVVN2aWRlb3NfTk5BIDwtIGFzLmRhdGEuZnJhbWUobmEub21pdChVU3ZpZGVvcykpClVTdmlkZW9zX05OQQpgYGAKCiMjIyBDb252ZXJ0IGRhdGUgY29sdW1uCgpUaGVuIHdlIG5lZWQgdG8gY29udmVydCB0aGUgY29sdW1uIGNhbGxlZCAidHJlbmRpbmdfZGF0ZSIgd2l0aCBjaGFyYWN0ZXIgdHlwZSB0byBub3JtYWwgZGF0ZSBmb3JtYXQgaW4gImx1YnJpZGF0ZSIgcGFja2FnZS4KCmBgYHtyIGNvbXZlcnQgdG8gbHVicmlkYXRlfQpVU3ZpZGVvc19OTkEgPC0gVVN2aWRlb3NfTk5BICU+JQogIG11dGF0ZSh0cmVuZGluZ19kYXRlID0geWRtKHRyZW5kaW5nX2RhdGUpKQpgYGAKCgojIyBSZXN1bHQgb3ZlcnZpZXcKCk5vdyBsZXQncyBsb29rIHRocm91Z2ggdGhlIHN0cnVjdHVyZSBvZiBkYXRhc2V0IGFnYWluLgoKYGBge3IgcmVzdWx0IG92ZXJ2aWV3fQpzdHIoVVN2aWRlb3NfTk5BKQpgYGAKCgo=